Enable PDL in triple chevron launch #3282

bernhardmgruber · 2025-01-08T14:08:45Z

It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature.

github-actions · 2025-01-08T16:40:01Z

🟩 CI finished in 2h 29m: Pass: 100%/94 | Total: 2d 10h | Avg: 37m 26s | Max: 1h 14m | Hits: 71%/12380

🟩 cub: Pass: 100%/46 | Total: 1d 11h | Avg: 46m 13s | Max: 1h 14m | Hits: 55%/3120

🟩 cpu
  🟩 amd64              Pass: 100%/44  | Total:  1d 09h | Avg: 45m 36s | Max:  1h 14m | Hits:  55%/3120  
  🟩 arm64              Pass: 100%/2   | Total:  1h 59m | Avg: 59m 51s | Max:  1h 04m
🟩 ctk
  🟩 11.1               Pass: 100%/7   | Total:  2h 00m | Avg: 17m 16s | Max: 45m 36s | Hits:  99%/780   
  🟩 12.5               Pass: 100%/2   | Total:  2h 14m | Avg:  1h 07m | Max:  1h 07m
  🟩 12.6               Pass: 100%/37  | Total:  1d 07h | Avg: 50m 33s | Max:  1h 14m | Hits:  40%/2340  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 57m | Avg: 58m 37s | Max: 59m 08s
  🟩 nvcc11.1           Pass: 100%/7   | Total:  2h 00m | Avg: 17m 16s | Max: 45m 36s | Hits:  99%/780   
  🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 14m | Avg:  1h 07m | Max:  1h 07m
  🟩 nvcc12.6           Pass: 100%/35  | Total:  1d 05h | Avg: 50m 05s | Max:  1h 14m | Hits:  40%/2340  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 57m | Avg: 58m 37s | Max: 59m 08s
  🟩 nvcc               Pass: 100%/44  | Total:  1d 09h | Avg: 45m 39s | Max:  1h 14m | Hits:  55%/3120  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total:  2h 01m | Avg: 30m 15s | Max: 59m 16s
  🟩 Clang10            Pass: 100%/1   | Total: 54m 27s | Avg: 54m 27s | Max: 54m 27s
  🟩 Clang11            Pass: 100%/1   | Total: 53m 30s | Avg: 53m 30s | Max: 53m 30s
  🟩 Clang12            Pass: 100%/1   | Total: 58m 36s | Avg: 58m 36s | Max: 58m 36s
  🟩 Clang13            Pass: 100%/1   | Total: 58m 35s | Avg: 58m 35s | Max: 58m 35s
  🟩 Clang14            Pass: 100%/1   | Total: 53m 01s | Avg: 53m 01s | Max: 53m 01s
  🟩 Clang15            Pass: 100%/1   | Total: 54m 22s | Avg: 54m 22s | Max: 54m 22s
  🟩 Clang16            Pass: 100%/1   | Total: 53m 15s | Avg: 53m 15s | Max: 53m 15s
  🟩 Clang17            Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
  🟩 Clang18            Pass: 100%/7   | Total:  5h 35m | Avg: 47m 55s | Max: 59m 08s
  🟩 GCC7               Pass: 100%/4   | Total:  3h 11m | Avg: 47m 56s | Max: 53m 42s
  🟩 GCC8               Pass: 100%/1   | Total: 54m 38s | Avg: 54m 38s | Max: 54m 38s
  🟩 GCC9               Pass: 100%/3   | Total:  1h 09m | Avg: 23m 08s | Max:  1h 00m
  🟩 GCC10              Pass: 100%/1   | Total: 54m 31s | Avg: 54m 31s | Max: 54m 31s
  🟩 GCC11              Pass: 100%/1   | Total: 55m 45s | Avg: 55m 45s | Max: 55m 45s
  🟩 GCC12              Pass: 100%/3   | Total:  1h 36m | Avg: 32m 09s | Max: 56m 23s
  🟩 GCC13              Pass: 100%/8   | Total:  5h 46m | Avg: 43m 20s | Max:  1h 14m
  🟩 MSVC14.16          Pass: 100%/1   | Total: 15m 23s | Avg: 15m 23s | Max: 15m 23s | Hits:  99%/780   
  🟩 MSVC14.29          Pass: 100%/1   | Total:  1h 05m | Avg:  1h 05m | Max:  1h 05m | Hits:  40%/780   
  🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 17m | Avg:  1h 08m | Max:  1h 12m | Hits:  40%/1560  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 14m | Avg:  1h 07m | Max:  1h 07m
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total: 15h 03m | Avg: 47m 34s | Max:  1h 01m
  🟩 GCC                Pass: 100%/21  | Total: 14h 29m | Avg: 41m 23s | Max:  1h 14m
  🟩 MSVC               Pass: 100%/4   | Total:  3h 38m | Avg: 54m 36s | Max:  1h 12m | Hits:  55%/3120  
  🟩 NVHPC              Pass: 100%/2   | Total:  2h 14m | Avg:  1h 07m | Max:  1h 07m
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 40m 05s | Avg: 20m 02s | Max: 24m 10s
  🟩 v100               Pass: 100%/44  | Total:  1d 10h | Avg: 47m 24s | Max:  1h 14m | Hits:  55%/3120  
🟩 jobs
  🟩 Build              Pass: 100%/39  | Total:  1d 08h | Avg: 49m 27s | Max:  1h 12m | Hits:  55%/3120  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 19m 48s | Avg: 19m 48s | Max: 19m 48s
  🟩 GraphCapture       Pass: 100%/1   | Total:  1h 14m | Avg:  1h 14m | Max:  1h 14m
  🟩 HostLaunch         Pass: 100%/3   | Total: 53m 44s | Avg: 17m 54s | Max: 21m 21s
  🟩 TestGPU            Pass: 100%/2   | Total: 49m 20s | Avg: 24m 40s | Max: 24m 54s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 40m 05s | Avg: 20m 02s | Max: 24m 10s
  🟩 90a                Pass: 100%/1   | Total: 24m 05s | Avg: 24m 05s | Max: 24m 05s
🟩 std
  🟩 11                 Pass: 100%/5   | Total:  2h 33m | Avg: 30m 45s | Max: 52m 39s
  🟩 14                 Pass: 100%/3   | Total:  2h 08m | Avg: 42m 47s | Max: 59m 16s | Hits:  99%/780   
  🟩 17                 Pass: 100%/12  | Total: 10h 01m | Avg: 50m 08s | Max:  1h 07m | Hits:  40%/1560  
  🟩 20                 Pass: 100%/26  | Total: 20h 42m | Avg: 47m 46s | Max:  1h 14m | Hits:  40%/780

🟩 thrust: Pass: 100%/45 | Total: 22h 37m | Avg: 30m 09s | Max: 1h 08m | Hits: 77%/9260

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 45m 33s | Avg: 22m 46s | Max: 29m 03s
🟩 cpu
  🟩 amd64              Pass: 100%/43  | Total: 21h 32m | Avg: 30m 02s | Max:  1h 08m | Hits:  77%/9260  
  🟩 arm64              Pass: 100%/2   | Total:  1h 05m | Avg: 32m 41s | Max: 35m 13s
🟩 ctk
  🟩 11.1               Pass: 100%/7   | Total:  1h 32m | Avg: 13m 12s | Max: 31m 56s | Hits:  99%/1852  
  🟩 12.5               Pass: 100%/2   | Total:  1h 48m | Avg: 54m 08s | Max: 57m 55s
  🟩 12.6               Pass: 100%/36  | Total: 19h 16m | Avg: 32m 07s | Max:  1h 08m | Hits:  71%/7408  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 00m | Avg: 30m 05s | Max: 31m 59s
  🟩 nvcc11.1           Pass: 100%/7   | Total:  1h 32m | Avg: 13m 12s | Max: 31m 56s | Hits:  99%/1852  
  🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 48m | Avg: 54m 08s | Max: 57m 55s
  🟩 nvcc12.6           Pass: 100%/34  | Total: 18h 16m | Avg: 32m 15s | Max:  1h 08m | Hits:  71%/7408  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 00m | Avg: 30m 05s | Max: 31m 59s
  🟩 nvcc               Pass: 100%/43  | Total: 21h 37m | Avg: 30m 10s | Max:  1h 08m | Hits:  77%/9260  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total:  1h 09m | Avg: 17m 21s | Max: 35m 17s
  🟩 Clang10            Pass: 100%/1   | Total: 36m 43s | Avg: 36m 43s | Max: 36m 43s
  🟩 Clang11            Pass: 100%/1   | Total: 30m 57s | Avg: 30m 57s | Max: 30m 57s
  🟩 Clang12            Pass: 100%/1   | Total: 30m 57s | Avg: 30m 57s | Max: 30m 57s
  🟩 Clang13            Pass: 100%/1   | Total: 31m 14s | Avg: 31m 14s | Max: 31m 14s
  🟩 Clang14            Pass: 100%/1   | Total: 31m 07s | Avg: 31m 07s | Max: 31m 07s
  🟩 Clang15            Pass: 100%/1   | Total: 33m 50s | Avg: 33m 50s | Max: 33m 50s
  🟩 Clang16            Pass: 100%/1   | Total: 33m 25s | Avg: 33m 25s | Max: 33m 25s
  🟩 Clang17            Pass: 100%/1   | Total: 35m 42s | Avg: 35m 42s | Max: 35m 42s
  🟩 Clang18            Pass: 100%/7   | Total:  2h 57m | Avg: 25m 22s | Max: 34m 42s
  🟩 GCC7               Pass: 100%/4   | Total:  1h 54m | Avg: 28m 33s | Max: 31m 56s
  🟩 GCC8               Pass: 100%/1   | Total: 36m 15s | Avg: 36m 15s | Max: 36m 15s
  🟩 GCC9               Pass: 100%/3   | Total: 42m 55s | Avg: 14m 18s | Max: 34m 08s
  🟩 GCC10              Pass: 100%/1   | Total: 33m 37s | Avg: 33m 37s | Max: 33m 37s
  🟩 GCC11              Pass: 100%/1   | Total: 34m 07s | Avg: 34m 07s | Max: 34m 07s
  🟩 GCC12              Pass: 100%/1   | Total: 37m 09s | Avg: 37m 09s | Max: 37m 09s
  🟩 GCC13              Pass: 100%/8   | Total:  3h 34m | Avg: 26m 46s | Max: 36m 41s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 19m 37s | Avg: 19m 37s | Max: 19m 37s | Hits:  99%/1852  
  🟩 MSVC14.29          Pass: 100%/1   | Total: 58m 07s | Avg: 58m 07s | Max: 58m 07s | Hits:  62%/1852  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 27m | Avg: 49m 19s | Max:  1h 08m | Hits:  74%/5556  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 48m | Avg: 54m 08s | Max: 57m 55s
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total:  8h 30m | Avg: 26m 53s | Max: 36m 43s
  🟩 GCC                Pass: 100%/19  | Total:  8h 32m | Avg: 26m 58s | Max: 37m 09s
  🟩 MSVC               Pass: 100%/5   | Total:  3h 45m | Avg: 45m 08s | Max:  1h 08m | Hits:  77%/9260  
  🟩 NVHPC              Pass: 100%/2   | Total:  1h 48m | Avg: 54m 08s | Max: 57m 55s
🟩 gpu
  🟩 v100               Pass: 100%/45  | Total: 22h 37m | Avg: 30m 09s | Max:  1h 08m | Hits:  77%/9260  
🟩 jobs
  🟩 Build              Pass: 100%/39  | Total: 20h 58m | Avg: 32m 16s | Max:  1h 08m | Hits:  71%/7408  
  🟩 TestCPU            Pass: 100%/3   | Total: 37m 31s | Avg: 12m 30s | Max: 21m 35s | Hits:  99%/1852  
  🟩 TestGPU            Pass: 100%/3   | Total:  1h 01m | Avg: 20m 27s | Max: 32m 10s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total: 21m 21s | Avg: 21m 21s | Max: 21m 21s
🟩 std
  🟩 11                 Pass: 100%/5   | Total:  1h 23m | Avg: 16m 47s | Max: 27m 33s
  🟩 14                 Pass: 100%/3   | Total:  1h 26m | Avg: 28m 50s | Max: 35m 17s | Hits:  99%/1852  
  🟩 17                 Pass: 100%/12  | Total:  7h 04m | Avg: 35m 20s | Max: 58m 07s | Hits:  62%/3704  
  🟩 20                 Pass: 100%/23  | Total: 11h 57m | Avg: 31m 11s | Max:  1h 08m | Hits:  80%/3704

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 48s | Avg: 4m 54s | Max: 7m 34s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  9m 48s | Avg:  4m 54s | Max:  7m 34s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  9m 48s | Avg:  4m 54s | Max:  7m 34s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 48s | Avg:  4m 54s | Max:  7m 34s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  9m 48s | Avg:  4m 54s | Max:  7m 34s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  9m 48s | Avg:  4m 54s | Max:  7m 34s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  9m 48s | Avg:  4m 54s | Max:  7m 34s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  9m 48s | Avg:  4m 54s | Max:  7m 34s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 14s | Avg:  2m 14s | Max:  2m 14s
  🟩 Test               Pass: 100%/1   | Total:  7m 34s | Avg:  7m 34s | Max:  7m 34s

🟩 python: Pass: 100%/1 | Total: 25m 35s | Avg: 25m 35s | Max: 25m 35s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 25m 35s | Avg: 25m 35s | Max: 25m 35s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 25m 35s | Avg: 25m 35s | Max: 25m 35s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 25m 35s | Avg: 25m 35s | Max: 25m 35s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 25m 35s | Avg: 25m 35s | Max: 25m 35s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 25m 35s | Avg: 25m 35s | Max: 25m 35s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 25m 35s | Avg: 25m 35s | Max: 25m 35s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 25m 35s | Avg: 25m 35s | Max: 25m 35s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 25m 35s | Avg: 25m 35s | Max: 25m 35s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
+/-	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 94)

#	Runner
69	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature.

implement `add_sat` split `signed`/`unsigned` implementation, improve implementation for MSVC improve device `add_sat` implementation add `add_sat` test improve generic `add_sat` implementation for signed types implement `sub_sat` allow more msvc intrinsics on x86 add op tests partially implement `mul_sat` implement `div_sat` and `saturate_cast` add `saturate_cast` test simplify `div_sat` test Deprectate C++11 and C++14 for libcu++ (#3173) * Deprectate C++11 and C++14 for libcu++ Co-authored-by: Bernhard Manfred Gruber <[email protected]> Implement `abs` and `div` from `cstdlib` (#3153) * implement integer abs functions * improve tests, fix constexpr support * just use the our implementation * implement `cuda::std::div` * prefer host's `div_t` like types * provide `cuda::std::abs` overloads for floats * allow fp abs for NVRTC * silence msvc's warning about conversion from floating point to integral Fix missing radix sort policies (#3174) Fixes NVBug 5009941 Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148) * introduces new arg{min,max} interface with two output iterators * adds fp inf tests * fixes docs * improves code example * fixes exec space specifier * trying to fix deprecation warning for more compilers * inlines unzip operator * trying to fix deprecation warning for nvhpc * integrates supression fixes in diagnostics * pre-ctk 11.5 deprecation suppression * fixes icc * fix for pre-ctk11.5 * cleans up deprecation suppression * cleanup Extend tuning documentation (#3179) Add codespell pre-commit hook, fix typos in CCCL (#3168) * Add codespell pre-commit hook * Automatic changes from codespell. * Manual changes. Fix parameter space for TUNE_LOAD in scan benchmark (#3176) fix various old compiler checks (#3178) implement C++26 `std::projected` (#3175) Fix pre-commit config for codespell and remaining typos (#3182) Massive cleanup of our config (#3155) Fix UB in atomics with automatic storage (#2586) * Adds specialized local cuda atomics and injects them into most atomics paths. Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> * Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478 * Remove extraneous double brackets in unformatted code. * Merge unsafe atomic logic into `__cuda_is_local`. * Use `const_cast` for type conversions in cuda_local.h * Fix build issues from interface changes * Fix missing __nanosleep on sm70- * Guard __isLocal from NVHPC * Use PTX instead of running nothing from NVHPC * fixup /s/nvrtc/nvhpc * Fixup missing CUDA ifdef surrounding device code * Fix codegen * Bypass some sort of compiler bug on GCC7 * Apply suggestions from code review * Use unsafe automatic storage atomics in codegen tests --------- Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]> Refactor the source code layout for `cuda.parallel` (#3177) * Refactor the source layout for cuda.parallel * Add copyright * Address review feedback * Don't import anything into `experimental` namespace * fix import --------- Co-authored-by: Ashwin Srinath <[email protected]> new type-erased memory resources (#2824) s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186) Document address stability of `thrust::transform` (#3181) * Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS * Reformat and fix UnaryFunction/BinaryFunction in transform docs * Mention transform can use proclaim_copyable_arguments * Document cuda::proclaims_copyable_arguments better * Deprecate depending on transform functor argument addresses Fixes: #3053 turn off cuda version check for clangd (#3194) [STF] jacobi example based on parallel_for (#3187) * Simple jacobi example with parallel for and reductions * clang-format * remove useless capture list fixes pre-nv_diag suppression issues (#3189) Prefer c2h::type_name over c2h::demangle (#3195) Fix memcpy_async* tests (#3197) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test Add type annotations and mypy checks for `cuda.parallel` (#3180) * Refactor the source layout for cuda.parallel * Add initial type annotations * Update pre-commit config * More typing * Fix bad merge * Fix TYPE_CHECKING and numpy annotations * typing bindings.py correctly * Address review feedback --------- Co-authored-by: Ashwin Srinath <[email protected]> Fix rendering of cuda.parallel docs (#3192) * Fix pre-commit config for codespell and remaining typos * Fix rendering of docs for cuda.parallel --------- Co-authored-by: Ashwin Srinath <[email protected]> Enable PDL for DeviceMergeSortBlockSortKernel (#3199) The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC. This commit enables PDL when launching the kernel. Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface Fixes for Python 3.7 docs environment (#3206) Co-authored-by: Ashwin Srinath <[email protected]> Adds support for large number of items to `DeviceTransform` (#3172) * moves large problem test helper to common file * adds support for large num items to device transform * adds tests for large number of items to device interface * fixes format * addresses review comments cp_async_bulk: Fix test (#3198) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test * cp_async_bulk: Fix test The global memory pointer could be misaligned. cudax fixes for msvc 14.41 (#3200) avoid instantiating class templates in `is_same` implementation when possible (#3203) Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209) * Fix: make launchers a CUB detail; make kernel source functions hidden. * [pre-commit.ci] auto code formatting * Address review comments, fix which macro gets fixed. help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202) unify macros and cmake options that control the suppression of deprecation warnings (#3220) * unify macros and cmake options that control the suppression of deprecation warnings * suppress nvcc warning #186 in thrust header tests * suppress c++ dialect deprecation warnings in libcudacxx header tests Fx thread-reduce performance regression (#3225) cuda.parallel: In-memory caching of build objects (#3216) * Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <[email protected]> Just enough ranges for c++14 `span` (#3211) use generalized concepts portability macros to simplify the `range` concept (#3217) fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR` Use Ruff to sort imports (#3230) * Update pyproject.tomls for import sorting * Update files after running pre-commit * Move ruff config to pyproject.toml --------- Co-authored-by: Ashwin Srinath <[email protected]> fix tuning_scan sm90 config issue (#3236) Co-authored-by: Shijie Chen <[email protected]> [STF] Logical token (#3196) * Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs. * Add missing files * Check if a task implementation can match a prototype where the void_interface arguments are ignored * Implement ctx.abstract_logical_data() which relies on a void data interface * Illustrate how to use abstract handles in local contexts * Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages * Small improvements in the examples * Do not try to allocate or move void data * Do not use I as a variable * fix linkage error * rename abtract_logical_data into logical_token * Document logical token * fix spelling error * fix sphinx error * reflect name changes * use meaningful variable names * simplify logical_token implementation because writeback is already disabled * add a unit test for token elision * implement token elision in host_launch * Remove unused type * Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens * Much simpler is_tuple_invocable_with_filtered implementation * Fix buggy test * Factorize code * Document that we can ignore tokens for task and host_launch * Documentation for logical data freeze Fix ReduceByKey tuning (#3240) Fix RLE tuning (#3239) cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233) * Forbid non-contiguous arrays as inputs (or outputs) * Implement a more robust way to check for contiguity * Don't bother if cublas unavailable * Fix how we check for zero-element arrays * sort imports --------- Co-authored-by: Ashwin Srinath <[email protected]> expands support for more offset types in segmented benchmark (#3231) Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253) * Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects * Do not add option twice ptx: Add add_instruction.py (#3190) This file helps create the necessary structure for new PTX instructions. Co-authored-by: Allard Hendriksen <[email protected]> Bump main to 2.9.0. (#3247) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop cub::Mutex (#3251) Fixes: #3250 Remove legacy macros from CUB util_arch.cuh (#3257) Fixes: #3256 Remove thrust::[unary|binary]_traits (#3260) Fixes: #3259 Architecture and OS identification macros (#3237) Bump main to 3.0.0. (#3265) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop thrust not1 and not2 (#3264) Fixes: #3263 CCCL Internal macro documentation (#3238) Deprecate GridBarrier and GridBarrierLifetime (#3258) Fixes: #1389 Require at least gcc7 (#3268) Fixes: #3267 Drop thrust::[unary|binary]_function (#3274) Fixes: #3273 Drop ICC from CI (#3277) [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270) * Add a test to reproduce a bug observed with parallel_for on a host place * clang-format * use _CCCL_ASSERT * Attempt to debug * do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead * fix lambda expression * clang-format Enable thrust::identity test for non-MSVC (#3281) This seems to be an oversight when the test was added Co-authored-by: Michael Schellenberger Costa <[email protected]> Enable PDL in triple chevron launch (#3282) It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature. Disambiguate line continuations and macro continuations in <nv/target> (#3244) Drop VS 2017 from CI (#3287) Fixes: #3286 Drop ICC support in code (#3279) * Drop ICC from code Fixes: #3278 Co-authored-by: Michael Schellenberger Costa <[email protected]> Make CUB NVRTC commandline arguments come from a cmake template (#3292) Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295) Use process isolation instead of default hyper-v for Windows. (#3294) Try improving build times by using process isolation instead of hyper-v Co-authored-by: Michael Schellenberger Costa <[email protected]> [pre-commit.ci] pre-commit autoupdate (#3248) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6) - [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6) - [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop Thrust legacy arch macros (#3298) Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS Drop Thrust's compiler_fence.h (#3300) Drop CTK 11.x from CI (#3275) * Add cuda12.0-gcc7 devcontainer * Move MSVC2017 jobs to CTK 12.6 Those is the only combination where rapidsai has devcontainers * Add /Zc:__cplusplus for the libcudacxx tests * Only add excape hatch for affected CTKs * Workaround missing cudaLaunchKernelEx on MSVC cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK. * Workaround nvcc+MSVC issue * Regenerate devcontainers Fixes: #3249 Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop CUB's util_compiler.cuh (#3302) All contained macros were deprecated Update packman and repo_docs versions (#3293) Co-authored-by: Ashwin Srinath <[email protected]> Drop Thrust's deprecated compiler macros (#3301) Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305) Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506) * adds support for large number of items to three-way partition * adapts interface to use choose_signed_offset_t * integrates applicable feedback from device-select pr * changes behavior for empty problems * unifies grid constant macro * fixes kernel template specialization mismatch * integrates _CCCL_GRID_CONSTANT changes * resolve merge conflicts * fixes checks in test * fixes test verification * improves tests * makes few improvements to streaming dispatch * improves code comment on test * fixes unrelated compiler error * minor style improvements Refactor scan tunings (#3262) Require C++17 for compiling Thrust and CUB (#3255) * Issue an unsuppressable warning when compiling with < C++17 * Remove C++11/14 presets * Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers * Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14] * Remove CUB_ENABLE_DIALECT_CPP[11|14] * Update CI runs * Remove C++11/14 CI runs for CUB and Thrust * Raise compiler minimum versions for C++17 * Update ReadMe * Drop Thrust's cpp14_required.h * Add escape hatch for C++17 removal Fixes: #3252 Implement `views::empty` (#3254) * Disable pair conversion of subrange with clang in C++17 * Fix namespace views * Implement `views::empty` This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view Refactor `limits` and `climits` (#3221) * implement builtins for huge val, nan and nans * change `INFINITY` and `NAN` implementation for NVRTC cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311) * Add tests demonstrating usage of different iterators * Update documentation of reduce_into by merging import code snippet with the rest of the example * Add documentation for current iterators * Run pre-commit checks and update accordingly * Fix comments to refer to the proper lines in the code snippets in the docs Drop clang<14 from CI, update devcontainers. (#3309) Co-authored-by: Bernhard Manfred Gruber <[email protected]> [STF] Cleanup task dependencies object constructors (#3291) * Define tag types for access modes * - Rework how we build task_dep objects based on access mode tags - pack_state is now responsible for using a const_cast for read only data * Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums * It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes Disable test with a gcc-14 regression (#3297) Deprecate Thrust's cpp_compatibility.h macros (#3299) Remove dropped function objects from docs (#3319) Document `NV_TARGET` macros (#3313) [STF] Define ctx.pick_stream() which was missing for the unified context (#3326) * Define ctx.pick_stream() which was missing for the unified context * clang-format Deprecate cub::IterateThreadStore (#3337) Drop CUB's BinaryFlip operator (#3332) Deprecate cub::Swap (#3333) Clarify transform output can overlap input (#3323) Drop CUB APIs with a debug_synchronous parameter (#3330) Fixes: #3329 Drop CUB's util_compiler.cuh for real (#3340) PR #3302 planned to drop the file, but only dropped its content. This was an oversight. So let's drop the entire file. Drop cub::ValueCache (#3346) limits offset types for merge sort (#3328) Drop CDPv1 (#3344) Fixes: #3341 Drop thrust::void_t (#3362) Use cuda::std::addressof in Thrust (#3363) Fix all_of documentation for empty ranges (#3358) all_of always returns true on an empty range. [STF] Do not keep track of dangling events in a CUDA graph backend (#3327) * Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when the CUDA graph completes. Therefore keeping track of "dangling events" is a waste of time and resources. * replace can_ignore_dangling_events by track_dangling_events which leads to more readable code * When not storing the dangling events, we must still perform the deinit operations that were producing these events ! Extract scan kernels into NVRTC-compilable header (#3334) * Extract scan kernels into NVRTC-compilable header * Update cub/cub/device/dispatch/dispatch_scan.cuh Co-authored-by: Georgii Evtushenko <[email protected]> --------- Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Georgii Evtushenko <[email protected]> Drop deprecated aliases in Thrust functional (#3272) Fixes: #3271 Drop cub::DivideAndRoundUp (#3347) Use cuda::std::min/max in Thrust (#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (#2773) Deprecate thrust::null_type (#3367) Deprecate cub::DeviceSpmv (#3320) Fixes: #896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (#3385) cuda.parallel: Support structured types as algorithm inputs (#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (#3324) Fixes: #100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (#3366) Replace `typedef` with `using` in libcu++ (#3368) Deprecate thrust::optional (#3307) Fixes: #3306 Upgrade to Catch2 3.8 (#3310) Fixes: #1724 refactor `<cuda/std/cstdint>` (#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (#3408) Implement more cmath functions to be usable on host and device (#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes #3411 Extend CUB reduce benchmarks (#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: #3283 Update upload-pages-artifact to v3 (#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (#3421) `std::linalg` accessors and `transposed_layout` (#2962) Add round up/down to multiple (#3234) [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178 * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996 * Install CCCL headers under cuda.cccl.include Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562 Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2. Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971 * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d. * Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes #3404 move to c++17, finalize device optimization fix msvc compilation, update tests Deprectate C++11 and C++14 for libcu++ (#3173) * Deprectate C++11 and C++14 for libcu++ Co-authored-by: Bernhard Manfred Gruber <[email protected]> Implement `abs` and `div` from `cstdlib` (#3153) * implement integer abs functions * improve tests, fix constexpr support * just use the our implementation * implement `cuda::std::div` * prefer host's `div_t` like types * provide `cuda::std::abs` overloads for floats * allow fp abs for NVRTC * silence msvc's warning about conversion from floating point to integral Fix missing radix sort policies (#3174) Fixes NVBug 5009941 Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148) * introduces new arg{min,max} interface with two output iterators * adds fp inf tests * fixes docs * improves code example * fixes exec space specifier * trying to fix deprecation warning for more compilers * inlines unzip operator * trying to fix deprecation warning for nvhpc * integrates supression fixes in diagnostics * pre-ctk 11.5 deprecation suppression * fixes icc * fix for pre-ctk11.5 * cleans up deprecation suppression * cleanup Extend tuning documentation (#3179) Add codespell pre-commit hook, fix typos in CCCL (#3168) * Add codespell pre-commit hook * Automatic changes from codespell. * Manual changes. Fix parameter space for TUNE_LOAD in scan benchmark (#3176) fix various old compiler checks (#3178) implement C++26 `std::projected` (#3175) Fix pre-commit config for codespell and remaining typos (#3182) Massive cleanup of our config (#3155) Fix UB in atomics with automatic storage (#2586) * Adds specialized local cuda atomics and injects them into most atomics paths. Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> * Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478 * Remove extraneous double brackets in unformatted code. * Merge unsafe atomic logic into `__cuda_is_local`. * Use `const_cast` for type conversions in cuda_local.h * Fix build issues from interface changes * Fix missing __nanosleep on sm70- * Guard __isLocal from NVHPC * Use PTX instead of running nothing from NVHPC * fixup /s/nvrtc/nvhpc * Fixup missing CUDA ifdef surrounding device code * Fix codegen * Bypass some sort of compiler bug on GCC7 * Apply suggestions from code review * Use unsafe automatic storage atomics in codegen tests --------- Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]> Refactor the source code layout for `cuda.parallel` (#3177) * Refactor the source layout for cuda.parallel * Add copyright * Address review feedback * Don't import anything into `experimental` namespace * fix import --------- Co-authored-by: Ashwin Srinath <[email protected]> new type-erased memory resources (#2824) s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186) Document address stability of `thrust::transform` (#3181) * Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS * Reformat and fix UnaryFunction/BinaryFunction in transform docs * Mention transform can use proclaim_copyable_arguments * Document cuda::proclaims_copyable_arguments better * Deprecate depending on transform functor argument addresses Fixes: #3053 turn off cuda version check for clangd (#3194) [STF] jacobi example based on parallel_for (#3187) * Simple jacobi example with parallel for and reductions * clang-format * remove useless capture list fixes pre-nv_diag suppression issues (#3189) Prefer c2h::type_name over c2h::demangle (#3195) Fix memcpy_async* tests (#3197) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test Add type annotations and mypy checks for `cuda.parallel` (#3180) * Refactor the source layout for cuda.parallel * Add initial type annotations * Update pre-commit config * More typing * Fix bad merge * Fix TYPE_CHECKING and numpy annotations * typing bindings.py correctly * Address review feedback --------- Co-authored-by: Ashwin Srinath <[email protected]> Fix rendering of cuda.parallel docs (#3192) * Fix pre-commit config for codespell and remaining typos * Fix rendering of docs for cuda.parallel --------- Co-authored-by: Ashwin Srinath <[email protected]> Enable PDL for DeviceMergeSortBlockSortKernel (#3199) The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC. This commit enables PDL when launching the kernel. Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface Fixes for Python 3.7 docs environment (#3206) Co-authored-by: Ashwin Srinath <[email protected]> Adds support for large number of items to `DeviceTransform` (#3172) * moves large problem test helper to common file * adds support for large num items to device transform * adds tests for large number of items to device interface * fixes format * addresses review comments cp_async_bulk: Fix test (#3198) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test * cp_async_bulk: Fix test The global memory pointer could be misaligned. cudax fixes for msvc 14.41 (#3200) avoid instantiating class templates in `is_same` implementation when possible (#3203) Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209) * Fix: make launchers a CUB detail; make kernel source functions hidden. * [pre-commit.ci] auto code formatting * Address review comments, fix which macro gets fixed. help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202) unify macros and cmake options that control the suppression of deprecation warnings (#3220) * unify macros and cmake options that control the suppression of deprecation warnings * suppress nvcc warning #186 in thrust header tests * suppress c++ dialect deprecation warnings in libcudacxx header tests Fx thread-reduce performance regression (#3225) cuda.parallel: In-memory caching of build objects (#3216) * Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <[email protected]> Just enough ranges for c++14 `span` (#3211) use generalized concepts portability macros to simplify the `range` concept (#3217) fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR` Use Ruff to sort imports (#3230) * Update pyproject.tomls for import sorting * Update files after running pre-commit * Move ruff config to pyproject.toml --------- Co-authored-by: Ashwin Srinath <[email protected]> fix tuning_scan sm90 config issue (#3236) Co-authored-by: Shijie Chen <[email protected]> [STF] Logical token (#3196) * Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs. * Add missing files * Check if a task implementation can match a prototype where the void_interface arguments are ignored * Implement ctx.abstract_logical_data() which relies on a void data interface * Illustrate how to use abstract handles in local contexts * Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages * Small improvements in the examples * Do not try to allocate or move void data * Do not use I as a variable * fix linkage error * rename abtract_logical_data into logical_token * Document logical token * fix spelling error * fix sphinx error * reflect name changes * use meaningful variable names * simplify logical_token implementation because writeback is already disabled * add a unit test for token elision * implement token elision in host_launch * Remove unused type * Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens * Much simpler is_tuple_invocable_with_filtered implementation * Fix buggy test * Factorize code * Document that we can ignore tokens for task and host_launch * Documentation for logical data freeze Fix ReduceByKey tuning (#3240) Fix RLE tuning (#3239) cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233) * Forbid non-contiguous arrays as inputs (or outputs) * Implement a more robust way to check for contiguity * Don't bother if cublas unavailable * Fix how we check for zero-element arrays * sort imports --------- Co-authored-by: Ashwin Srinath <[email protected]> expands support for more offset types in segmented benchmark (#3231) Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253) * Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects * Do not add option twice ptx: Add add_instruction.py (#3190) This file helps create the necessary structure for new PTX instructions. Co-authored-by: Allard Hendriksen <[email protected]> Bump main to 2.9.0. (#3247) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop cub::Mutex (#3251) Fixes: #3250 Remove legacy macros from CUB util_arch.cuh (#3257) Fixes: #3256 Remove thrust::[unary|binary]_traits (#3260) Fixes: #3259 Architecture and OS identification macros (#3237) Bump main to 3.0.0. (#3265) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop thrust not1 and not2 (#3264) Fixes: #3263 CCCL Internal macro documentation (#3238) Deprecate GridBarrier and GridBarrierLifetime (#3258) Fixes: #1389 Require at least gcc7 (#3268) Fixes: #3267 Drop thrust::[unary|binary]_function (#3274) Fixes: #3273 Drop ICC from CI (#3277) [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270) * Add a test to reproduce a bug observed with parallel_for on a host place * clang-format * use _CCCL_ASSERT * Attempt to debug * do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead * fix lambda expression * clang-format Enable thrust::identity test for non-MSVC (#3281) This seems to be an oversight when the test was added Co-authored-by: Michael Schellenberger Costa <[email protected]> Enable PDL in triple chevron launch (#3282) It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature. Disambiguate line continuations and macro continuations in <nv/target> (#3244) Drop VS 2017 from CI (#3287) Fixes: #3286 Drop ICC support in code (#3279) * Drop ICC from code Fixes: #3278 Co-authored-by: Michael Schellenberger Costa <[email protected]> Make CUB NVRTC commandline arguments come from a cmake template (#3292) Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295) Use process isolation instead of default hyper-v for Windows. (#3294) Try improving build times by using process isolation instead of hyper-v Co-authored-by: Michael Schellenberger Costa <[email protected]> [pre-commit.ci] pre-commit autoupdate (#3248) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6) - [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6) - [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop Thrust legacy arch macros (#3298) Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS Drop Thrust's compiler_fence.h (#3300) Drop CTK 11.x from CI (#3275) * Add cuda12.0-gcc7 devcontainer * Move MSVC2017 jobs to CTK 12.6 Those is the only combination where rapidsai has devcontainers * Add /Zc:__cplusplus for the libcudacxx tests * Only add excape hatch for affected CTKs * Workaround missing cudaLaunchKernelEx on MSVC cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK. * Workaround nvcc+MSVC issue * Regenerate devcontainers Fixes: #3249 Co-authored-by: Michael Schellenberger Costa <[email protected]> Update packman and repo_docs versions (#3293) Co-authored-by: Ashwin Srinath <[email protected]> Drop Thrust's deprecated compiler macros (#3301) Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305) Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506) * adds support for large number of items to three-way partition * adapts interface to use choose_signed_offset_t * integrates applicable feedback from device-select pr * changes behavior for empty problems * unifies grid constant macro * fixes kernel template specialization mismatch * integrates _CCCL_GRID_CONSTANT changes * resolve merge conflicts * fixes checks in test * fixes test verification * improves tests * makes few improvements to streaming dispatch * improves code comment on test * fixes unrelated compiler error * minor style improvements Refactor scan tunings (#3262) Require C++17 for compiling Thrust and CUB (#3255) * Issue an unsuppressable warning when compiling with < C++17 * Remove C++11/14 presets * Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers * Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14] * Remove CUB_ENABLE_DIALECT_CPP[11|14] * Update CI runs * Remove C++11/14 CI runs for CUB and Thrust * Raise compiler minimum versions for C++17 * Update ReadMe * Drop Thrust's cpp14_required.h * Add escape hatch for C++17 removal Fixes: #3252 Implement `views::empty` (#3254) * Disable pair conversion of subrange with clang in C++17 * Fix namespace views * Implement `views::empty` This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view Refactor `limits` and `climits` (#3221) * implement builtins for huge val, nan and nans * change `INFINITY` and `NAN` implementation for NVRTC cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311) * Add tests demonstrating usage of different iterators * Update documentation of reduce_into by merging import code snippet with the rest of the example * Add documentation for current iterators * Run pre-commit checks and update accordingly * Fix comments to refer to the proper lines in the code snippets in the docs Drop clang<14 from CI, update devcontainers. (#3309) Co-authored-by: Bernhard Manfred Gruber <[email protected]> [STF] Cleanup task dependencies object constructors (#3291) * Define tag types for access modes * - Rework how we build task_dep objects based on access mode tags - pack_state is now responsible for using a const_cast for read only data * Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums * It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes Disable test with a gcc-14 regression (#3297) Deprecate Thrust's cpp_compatibility.h macros (#3299) Remove dropped function objects from docs (#3319) Document `NV_TARGET` macros (#3313) [STF] Define ctx.pick_stream() which was missing for the unified context (#3326) * Define ctx.pick_stream() which was missing for the unified context * clang-format Deprecate cub::IterateThreadStore (#3337) Drop CUB's BinaryFlip operator (#3332) Deprecate cub::Swap (#3333) Clarify transform output can overlap input (#3323) Drop CUB APIs with a debug_synchronous parameter (#3330) Fixes: #3329 Drop CUB's util_compiler.cuh for real (#3340) PR #3302 planned to drop the file, but only dropped its content. This was an oversight. So let's drop the entire file. Drop cub::ValueCache (#3346) limits offset types for merge sort (#3328) Drop CDPv1 (#3344) Fixes: #3341 Drop thrust::void_t (#3362) Use cuda::std::addressof in Thrust (#3363) Fix all_of documentation for empty ranges (#3358) all_of always returns true on an empty range. [STF] Do not keep track of dangling events in a CUDA graph backend (#3327) * Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when the CUDA graph completes. Therefore keeping track of "dangling events" is a waste of time and resources. * replace can_ignore_dangling_events by track_dangling_events which leads to more readable code * When not storing the dangling events, we must still perform the deinit operations that were producing these events ! Extract scan kernels into NVRTC-compilable header (#3334) * Extract scan kernels into NVRTC-compilable header * Update cub/cub/device/dispatch/dispatch_scan.cuh Co-authored-by: Georgii Evtushenko <[email protected]> --------- Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Georgii Evtushenko <[email protected]> Drop deprecated aliases in Thrust functional (#3272) Fixes: #3271 Drop cub::DivideAndRoundUp (#3347) Use cuda::std::min/max in Thrust (#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (#2773) Deprecate thrust::null_type (#3367) Deprecate cub::DeviceSpmv (#3320) Fixes: #896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (#3385) cuda.parallel: Support structured types as algorithm inputs (#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (#3324) Fixes: #100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (#3366) Replace `typedef` with `using` in libcu++ (#3368) Deprecate thrust::optional (#3307) Fixes: #3306 Upgrade to Catch2 3.8 (#3310) Fixes: #1724 refactor `<cuda/std/cstdint>` (#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (#3408) Implement more cmath functions to be usable on host and device (#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes #3411 Extend CUB reduce benchmarks (#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: #3283 Update upload-pages-artifact to v3 (#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (#3421) `std::linalg` accessors and `transposed_layout` (#2962) Add round up/down to multiple (#3234) [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178 * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996 * Install CCCL headers under cuda.cccl.include Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562 Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2. Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971 * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d. * Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes #3404 Fix CI issues (#3443) update docs fix review restrict allowed types replace constexpr implementations with generic optimize `__is_arithmetic_integral`

Enable PDL in triple chevron launch

1b1e82f

It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature.

bernhardmgruber requested review from a team as code owners January 8, 2025 14:08

bernhardmgruber requested review from alliepiper and gonidelis January 8, 2025 14:08

bernhardmgruber added the thrust For all items related to Thrust. label Jan 8, 2025

elstehle approved these changes Jan 8, 2025

View reviewed changes

bernhardmgruber enabled auto-merge (squash) January 8, 2025 14:33

miscco approved these changes Jan 8, 2025

View reviewed changes

bernhardmgruber merged commit 7ffdaac into NVIDIA:main Jan 8, 2025
111 checks passed

bernhardmgruber deleted the pdl branch January 8, 2025 16:52

davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 18, 2025

Enable PDL in triple chevron launch (NVIDIA#3282)

3f3f897

It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable PDL in triple chevron launch #3282

Enable PDL in triple chevron launch #3282

bernhardmgruber commented Jan 8, 2025

github-actions bot commented Jan 8, 2025

🟩 cub: Pass: 100%/46 | Total: 1d 11h | Avg: 46m 13s | Max: 1h 14m | Hits: 55%/3120

🟩 thrust: Pass: 100%/45 | Total: 22h 37m | Avg: 30m 09s | Max: 1h 08m | Hits: 77%/9260

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 48s | Avg: 4m 54s | Max: 7m 34s

🟩 python: Pass: 100%/1 | Total: 25m 35s | Avg: 25m 35s | Max: 25m 35s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 94)

Enable PDL in triple chevron launch #3282

Enable PDL in triple chevron launch #3282

Conversation

bernhardmgruber commented Jan 8, 2025

github-actions bot commented Jan 8, 2025

🟩 cub: Pass: 100%/46 | Total: 1d 11h | Avg: 46m 13s | Max: 1h 14m | Hits: 55%/3120

🟩 thrust: Pass: 100%/45 | Total: 22h 37m | Avg: 30m 09s | Max: 1h 08m | Hits: 77%/9260

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 48s | Avg: 4m 54s | Max: 7m 34s

🟩 python: Pass: 100%/1 | Total: 25m 35s | Avg: 25m 35s | Max: 25m 35s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 94)